19:23
2026-06-04
lesswrong.com
large-language-models
(Mis)generalization of Helpful-Only Fine-tuning
Researchers studying helpful-only (H-only) large language models found that existing models exhibit emergent misalignment, residual refusal behaviors, poor steerability, sycophancy, and incoherent chaβ¦